Search CORE

42 research outputs found

Fair Column Subset Selection

Author: Matakos Antonis
Ordozgoiti Bruno
Thejaswi Suhas
Publication venue
Publication date: 13/06/2023
Field of study

We consider the problem of fair column subset selection. In particular, we assume that two groups are present in the data, and the chosen column subset must provide a good approximation for both, relative to their respective best rank-k approximations. We show that this fair setting introduces significant challenges: in order to extend known results, one cannot do better than the trivial solution of simply picking twice as many columns as the original methods. We adopt a known approach based on deterministic leverage-score sampling, and show that merely sampling a subset of appropriate size becomes NP-hard in the presence of two groups. Whereas finding a subset of two times the desired size is trivial, we provide an efficient algorithm that achieves the same guarantees with essentially 1.5 times that size. We validate our methods through an extensive set of experiments on real-world data

arXiv.org e-Print Archive

Off-the-grid: Fast and Effective Hyperparameter Search for Kernel Clustering

Author: Muñoz Lluís A. Belanche
Ordozgoiti Bruno
Publication venue
Publication date: 01/01/2020
Field of study

Kernel functions are a powerful tool to enhance the

k

-means clustering algorithm via the kernel trick. It is known that the parameters of the chosen kernel function can have a dramatic impact on the result. In supervised settings, these can be tuned via cross-validation, but for clustering this is not straightforward and heuristics are usually employed. In this paper we study the impact of kernel parameters on kernel

k

-means. In particular, we derive a lower bound, tight up to constant factors, below which the parameter of the RBF kernel will render kernel

k

-means meaningless. We argue that grid search can be ineffective for hyperparameter search in this context and propose an alternative algorithm for this purpose. In addition, we offer an efficient implementation based on fast approximate exponentiation with provable quality guarantees. Our experimental results demonstrate the ability of our method to efficiently reveal a rich and useful set of hyperparameter values.Comment: ECML-PKDD 202

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

Diversity-aware $k$ -median : Clustering with fair center representation

Author: Gionis Aristides
Ordozgoiti Bruno
Thejaswi Suhas
Publication venue
Publication date: 22/06/2021
Field of study

We introduce a novel problem for diversity-aware clustering. We assume that the potential cluster centers belong to a set of groups defined by protected attributes, such as ethnicity, gender, etc. We then ask to find a minimum-cost clustering of the data into

k

clusters so that a specified minimum number of cluster centers are chosen from each group. We thus require that all groups are represented in the clustering solution as cluster centers, according to specified requirements. More precisely, we are given a set of clients

C

, a set of facilities \pazocal{F}, a collection

\mathcal{F}=\{F_1,\dots,F_t\}

of facility groups F_i \subseteq \pazocal{F}, budget

k

, and a set of lower-bound thresholds

R=\{r_1,\dots,r_t\}

, one for each group in

\mathcal{F}

. The \emph{diversity-aware

k

-median problem} asks to find a set

S

k

facilities in \pazocal{F} such that

|S \cap F_i| \geq r_i

, that is, at least

r_i

centers in

S

are from group

F_i

, and the

k

-median cost

\sum_{c \in C} \min_{s \in S} d(c,s)

is minimized. We show that in the general case where the facility groups may overlap, the diversity-aware

k

-median problem is \np-hard, fixed-parameter intractable, and inapproximable to any multiplicative factor. On the other hand, when the facility groups are disjoint, approximation algorithms can be obtained by reduction to the \emph{matroid median} and \emph{red-blue median} problems. Experimentally, we evaluate our approximation methods for the tractable cases, and present a relaxation-based heuristic for the theoretically intractable case, which can provide high-quality and efficient solutions for real-world datasets.Comment: To appear in ECML-PKDD 202

arXiv.org e-Print Archive

Provable randomized rounding for minimum-similarity diversification

Author: Gionis Aristides
Mahadevan Ananth
Matakos Antonis
Ordozgoiti Bruno
Publication venue
Publication date: 01/01/2022
Field of study

When searching for information in a data collection, we are often interested not only in finding relevant items, but also in assembling a diverse set, so as to explore different concepts that are present in the data. This problem has been researched extensively. However, finding a set of items with minimal pairwise similarities can be computationally challenging, and most existing works striving for quality guarantees assume that item relatedness is measured by a distance function. Given the widespread use of similarity functions in many domains, we believe this to be an important gap in the literature. In this paper we study the problem of finding a diverse set of items, when item relatedness is measured by a similarity function. We formulate the diversification task using a flexible, broadly applicable minimization objective, consisting of the sum of pairwise similarities of the selected items and a relevance penalty term. To find good solutions we adopt a randomized rounding strategy, which is challenging to analyze because of the cardinality constraint present in our formulation. Even though this obstacle can be overcome using dependent rounding, we show that it is possible to obtain provably good solutions using an independent approach, which is faster, simpler to implement and completely parallelizable. Our analysis relies on a novel bound for the ratio of Poisson-Binomial densities, which is of independent interest and has potential implications for other combinatorial-optimization problems. We leverage this result to design an efficient randomized algorithm that provides a lower-order additive approximation guarantee. We validate our method using several benchmark datasets, and show that it consistently outperforms the greedy approaches that are commonly used in the literature.Peer reviewe

PubMed Central

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Discovering Polarized Communities in Signed Networks

Author: Aristides Gionis
Bruno Ordozgoiti
Francesco Bonchi
Galimberti Edoardo
Ruffo Giancarlo Francesco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

Institutional Research Information System University of Turin

Recent results and open problems in spectral algorithms for signed graphs

Author: Ordozgoiti Rubio Bruno
Publication venue: Asociación Nacional de Estudiantes de Matemáticas
Publication date: 28/08/2021
Field of study

| openaire: EC/H2020/871042/EU//SoBigData-PlusPlusIn signed graphs, edges are labeled with either a positive or a negative sign.This small modification greatly enriches the representation capabilities of graphs.However, their spectral properties undergo significant changes, introducing newchallenges in related optimization problems. In this extended abstract we discussrecent results in spectral methods for signed graph partitioning and communitydetection, and propose open problems arising in this context.Peer reviewe

Aaltodoc Publication Archive

Column subset selection in practice : efficient heuristics and regularization

Author: Ordozgoiti Rubio Bruno
Publication venue: 'Universidad Politecnica de Madrid - University Library'
Publication date: 01/01/2018
Field of study

Hoy en día, los datos están disponibles en un volumen sin precedentes. Una cantidad abrumadora de dispositivos conectados a Internet genera un flujo constante de información por todo el mundo, mucha de la cual se procesa en tiempo real o se almacena para usos posteriores. Extraer conocimiento de estos ingentes conjuntos de datos suele ser una tarea difícil. Su tamaño exige el uso de recursos computacionales masivos, lo que motiva el diseño de algoritmos eficientes. Además, estos datos suelen contener mediciones de una gran cantidad de variables, y esto acarrea una amplia serie de problemas. Para atajarlos, se estudia una familia de técnicas conocida como ”reducción de la dimensionalidad”. En esta tesis abordamos el tema de la selección de variables, un subconjunto de las técnicas de reducción de la dimensionalidad con la particularidad de que preservan el significado semántico de las variables originales. Para ello, analizamos un problema conocido ”selección de un subconjunto de columnas”. Una ventaja significativa de este problema es que da lugar a modelos simples y en ocasiones fáciles de interpretar. En la actualidad, los avances en ciencias de la computación aplicada van a menudo acompañados de preocupaciones sobre ética y transparencia, por lo que la simplicidad de los modelos utilizados puede ser clave en muchos casos. El problema de la ”selección de un subconjunto de columnas” ha recibido bastante atención durante los últimos años, principalmente desde un punto de vista teórico. En este texto analizamos el problema desde una perspectiva más práctica. Las contribuciones aquí recogidas se resumen a continuación. En primer lugar proponemos el uso de una heurística de búsqueda local. Mostramos empíricamente que ofrece un rendimiento superior al de otros algoritmos y demostramos garantías de aproximación elementales. Además, aprovechamos la naturaleza del problema para formular una implementación eficiente adecuada para el uso práctico. En segundo lugar, introducimos formulaciones regularizadas del problema. Proponemos un algoritmo voraz para optimizarlas y demostramos empíricamente que los subconjuntos de columnas resultantes son superiores con respecto a múltiples criterios. ----------ABSTRACT---------- Today, data are available at an unprecedented scale. An overwhelming quantity of Internet-connected devices generate a constant trickle of pieces of information all over the world, much of which are processed in real time or stored for later use. Making sense of these enormous data sets is often an challenging endeavour. Their size demands the use of massive computational resources, which motivates the design of efficient algorithms. Additionally, these data usually contain measurements of a large number of variables, which poses a wide variety of problems. To address the latter, a family of techniques commonly referred to as dimensionality reduction is studied. In this thesis we address the problem of feature selection, a subset of dimensionality reduction methods that preserve the semantic meaning of the original data variables. To do so, we analyze a problem formulation known as column subset selection. A significant advantage of column subset selection is that the models it produces are simple and in some cases easy to interpret. In an age where notable advances in applied computer science are met with growing concerns about ethics and transparency, model simplicity can become a key requirement in many scenarios. The column subset selection problem has received significant attention in the computer science literature over the last few years, mainly from a theoretical perspective. Here we analyze the problem from a more practical standpoint. Our contributions can be summarized as follows. First, we propose the use of a local search heuristic. We show empirically that it outperforms existing algorithms and derive elementary approximation guarantees. Furthermore, we take advantage of the nature of the problem formulation to derive an efficient implementation suitable for practical use. Second, we introduce regularized formulations of the problem. We derive a greedy algorithm for these new objectives and demonstrate empirically that it produces improved subsets with respect to multiple criteria

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital UPM

Reconciliation k-median

Author: Gionis Aristides
Ordozgoiti Bruno
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2019
Field of study

| openaire: EC/H2020/654024/EU//SoBigDataWe propose a new variant of the k-median problem, where the objective function models not only the cost of assigning data points to cluster representatives, but also a penalty term for disagreement among the representatives. We motivate this novel problem by applications where we are interested in clustering data while avoiding selecting representatives that are too far from each other. For example, we may want to summarize a set of news sources, but avoid selecting ideologically-extreme articles in order to reduce polarization. To solve the proposed k-median formulation we adopt the local-search algorithm of Arya et al. [2], We show that the algorithm provides a provable approximation guarantee, which becomes constant under a mild assumption on the minimum number of points for each cluster. We experimentally evaluate our problem formulation and proposed algorithm on datasets inspired by the motivating applications. In particular, we experiment with data extracted from Twitter, the US Congress voting records,and popular news sources. The results show that our objective can lead to choosing less polarized groups of representatives without significant loss in representation fidelity.Peer reviewe

Crossref

Aaltodoc Publication Archive